Storm Data Analysis - California (2000-2024)¶

Table of Contents¶

  • Introduction
  • Load and Check Data
  • Transform Data
  • Exploratory Data Analysis

Introduction¶

This notebook explores the impact of storm events in California from 2000 through 2024. It uses NOAA’s Storm Events Database to analyze event frequency and severity, including injuries, fatalities, and economic damage. The dataset was filtered to include only events with reported consequences and adjusted for inflation to reflect 2024 dollars. The analysis focuses on trends, high-impact events, and patterns across time and event types.

Methods¶

The following data processing steps were taken outside of this notebook:

  • Data was downloaded directly from the NOAA Storm Events Database in a compressed format: https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/
  • Only "details" files were used, which include key event attributes such as type, timing, location, injuries, deaths, and damage estimates.
  • After extraction, the data was filtered to include only rows where at least one form of impact (injuries, deaths, or damage) was reported.
  • Only events in California were retained.
  • Relevant columns were kept and data was saved to a single compressed .parquet file for efficient loading and processing in this notebook.

Notes on Data Quality¶

  • The NOAA Storm Events dataset, while incredibly useful, has known quality issues. A 2016 paper by Renato P. dos Santos (Some comments on the reliability of NOAA’s Storm Events Database) found that:
    • Over half the records lacked damage estimates.
    • Many entries contained errors or use non-standard event types.
    • Narrative fields were inconsistently populated and vary in quality.
  • Since 2016, some work has been done on the data, including standardizing the EVENT_TYPE variable.
  • This analysis likely underestimates total damage:
    • Missing values in damage columns were filled with zeroes.
    • Events without any recorded injuries, deaths, or economic damage were filtered out during preprocessing.
  • As a result, this dataset should be treated as incomplete rather than comprehensive. It's more suitable for exploratory visualizations than precise economic or policy conclusions.

Additional Information¶

  • Monetary figures were adjusted for inflation to 2024 dollars using CPI data from the U.S. Federal Reserve: https://fred.stlouisfed.org/series/CPIAUCSL

Links¶

  • Homepage for the Storm Events Database
  • PDF file with descriptions of the data fields

Load and Check Data¶

In [1]:
import os
import re
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio

from IPython.display import display

# Set default plotly theme
pio.templates.default = "plotly_white"
# Set to "notebook" to reduce file size by factor of 4
pio.renderers.default = "notebook" 
# Format pandas display for floats
pd.set_option('display.float_format', '{:,.0f}'.format)
In [2]:
# Set the path to the cleaned Parquet file
base_dir = os.path.dirname(os.path.abspath("__file__"))  # Use notebook location
data_path = os.path.join(base_dir, "data", "cleaned_storm_data_ca.parquet")

# Load the data
df_raw = pd.read_parquet(data_path)

# Quick check
print(f"Loaded {len(df_raw):,} rows")
df_raw.head()
Loaded 6,974 rows
Out[2]:
STATE EVENT_TYPE CZ_NAME BEGIN_DATE_TIME END_DATE_TIME INJURIES_DIRECT INJURIES_INDIRECT DEATHS_DIRECT DEATHS_INDIRECT DAMAGE_PROPERTY DAMAGE_CROPS EVENT_NARRATIVE
0 CALIFORNIA Wildfire SAN BERNARDINO 06-JAN-00 04:26:00 06-JAN-00 10:00:00 1 0 0 0 30K 0 0
1 CALIFORNIA Dense Fog W RIVERSIDE T X NW 12-JAN-00 05:40:00 12-JAN-00 08:30:00 1 0 0 0 25K 0 0
2 CALIFORNIA Heavy Rain SACRAMENTO 22-JAN-00 18:00:00 24-JAN-00 23:00:00 1 0 0 0 15K 0 0
3 CALIFORNIA High Wind X SW SAN BERNARDINO 05-JAN-00 19:00:00 06-JAN-00 13:30:00 1 0 0 0 200K 0 0
4 CALIFORNIA High Wind W RIVERSIDE T X NW 05-JAN-00 19:00:00 06-JAN-00 13:50:00 1 0 0 0 200K 0 0
In [3]:
# Look at data characteristics
df_raw.describe(include='all')
Out[3]:
STATE EVENT_TYPE CZ_NAME BEGIN_DATE_TIME END_DATE_TIME INJURIES_DIRECT INJURIES_INDIRECT DEATHS_DIRECT DEATHS_INDIRECT DAMAGE_PROPERTY DAMAGE_CROPS EVENT_NARRATIVE
count 6974 6974 6974 6974 6974 6974 6974 6974 6974 6974 6974 6974
unique 1 33 253 6007 5810 44 21 14 6 401 130 5050
top CALIFORNIA Strong Wind SAN BERNARDINO 01-JAN-06 00:00:00 31-DEC-05 23:59:00 0 0 0 0 1.00K 0.00K 0
freq 6974 1618 803 16 14 6339 6820 6478 6826 882 4812 1863
In [4]:
# Check for missing values
df_raw.isnull().sum().sum()
Out[4]:
0

Transform Data¶

Data loaded from a .parquet file needs to be changed to proper data types before analysis.

In this section, we transform the data in the following ways:

  • Convert date and time columns to datetime format
  • Convert any numeric fields to numeric format
  • Convert the dollar-amount damage values from strings (sometimes with the letters K, M, or B) to numeric format
  • Fill missing damage amounts with 0
  • Adjust the damage column for inflation
  • Sum up the direct and indirect injuries, direct and indirect deaths, and property and crop damage into respective totals
  • Drop unused columns
In [5]:
# Combine both damage columns and find non-digit/non-dot characters
all_damage = pd.concat([df_raw['DAMAGE_PROPERTY'], df_raw['DAMAGE_CROPS']]).dropna()

non_numeric_chars = set()
for entry in all_damage:
    non_numeric_chars.update(re.findall(r'[^\d\.]', str(entry)))

print(non_numeric_chars)
{'K', 'B', 'M'}
In [6]:
# Create a deep copy
df = df_raw.copy(deep=True)

# Transform time fields to datetime
df['BEGIN_DATE_TIME'] = pd.to_datetime(df['BEGIN_DATE_TIME'], format='%d-%b-%y %H:%M:%S', errors='coerce')
df['END_DATE_TIME'] = pd.to_datetime(df['END_DATE_TIME'], format='%d-%b-%y %H:%M:%S', errors='coerce')

for col in ['INJURIES_DIRECT', 'INJURIES_INDIRECT', 'DEATHS_DIRECT', 'DEATHS_INDIRECT']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

def parse_damage(value):
    '''
    Parse damage columns. These may have the characters "K", "M", or "B".
    '''
    if pd.isnull(value) or value in ['0', '0.00K', '0.00M', '0.00B']:
        return 0.0
    try:
        num = float(value[:-1])
        scale = value[-1].upper()
        if scale == 'K':
            return num * 1_000
        elif scale == 'M':
            return num * 1_000_000
        elif scale == 'B':
            return num * 1_000_000_000
        else:
            return float(value)  # fallback if no scale
    except:
        return None

# Transform damage columns from string to numeric
df['DAMAGE_PROPERTY'] = df['DAMAGE_PROPERTY'].apply(parse_damage)
df['DAMAGE_CROPS'] = df['DAMAGE_CROPS'].apply(parse_damage)

# Fill missing damage with zero
df['DAMAGE_PROPERTY'] = df['DAMAGE_PROPERTY'].fillna(0)
df['DAMAGE_CROPS'] = df['DAMAGE_CROPS'].fillna(0)
In [7]:
# Check time columns for issues
print(df['BEGIN_DATE_TIME'].isnull().sum(), "invalid BEGIN_DATE_TIME entries")
print(df['END_DATE_TIME'].isnull().sum(), "invalid END_DATE_TIME entries")

# Check numeric columns
print(df[['INJURIES_DIRECT', 'INJURIES_INDIRECT', 'DEATHS_DIRECT', 'DEATHS_INDIRECT', 'DAMAGE_PROPERTY', 'DAMAGE_CROPS']].dtypes)
print(df[['INJURIES_DIRECT', 'INJURIES_INDIRECT', 'DEATHS_DIRECT', 'DEATHS_INDIRECT']].isnull().sum().sum())

print(df[['DAMAGE_PROPERTY', 'DAMAGE_CROPS']].isnull().sum())
0 invalid BEGIN_DATE_TIME entries
0 invalid END_DATE_TIME entries
INJURIES_DIRECT        int64
INJURIES_INDIRECT      int64
DEATHS_DIRECT          int64
DEATHS_INDIRECT        int64
DAMAGE_PROPERTY      float64
DAMAGE_CROPS         float64
dtype: object
0
DAMAGE_PROPERTY    0
DAMAGE_CROPS       0
dtype: int64
In [8]:
# Adjust for inflation

# Load CPI data
cpi_df = pd.read_csv("data/CPIAUCSL_yearly.csv")
cpi_df['YEAR'] = pd.to_datetime(cpi_df['observation_date']).dt.year
cpi_dict = dict(zip(cpi_df['YEAR'], cpi_df['CPIAUCSL']))

# Define base CPI
base_cpi = cpi_dict[2024]

# Add YEAR and MONTH columns
df['YEAR'] = df['BEGIN_DATE_TIME'].dt.year
df['MONTH'] = df['BEGIN_DATE_TIME'].dt.month

# Compute inflation adjustment factor for each row
df['CPI_FACTOR'] = df['YEAR'].map(lambda y: base_cpi / cpi_dict.get(y, base_cpi))

# Adjust damage values
df['DAMAGE_PROPERTY_ADJ'] = df['DAMAGE_PROPERTY'] * df['CPI_FACTOR']
df['DAMAGE_CROPS_ADJ'] = df['DAMAGE_CROPS'] * df['CPI_FACTOR']
In [9]:
# Create additional columns
df['TOTAL_INJURIES'] = df['INJURIES_DIRECT'] + df['INJURIES_INDIRECT']
df['TOTAL_DEATHS'] = df['DEATHS_DIRECT'] + df['DEATHS_INDIRECT']
df['TOTAL_DAMAGE_ADJ'] = df['DAMAGE_PROPERTY_ADJ'] + df['DAMAGE_CROPS_ADJ']

# Drop unused columns
df = df.drop(columns=["STATE", "INJURIES_DIRECT", "INJURIES_INDIRECT", "DEATHS_DIRECT", "DEATHS_INDIRECT",
                      "DAMAGE_PROPERTY", "DAMAGE_CROPS", "CPI_FACTOR",
                     "DAMAGE_PROPERTY_ADJ", "DAMAGE_CROPS_ADJ"])
In [10]:
# Check the new data frame
df
Out[10]:
EVENT_TYPE CZ_NAME BEGIN_DATE_TIME END_DATE_TIME EVENT_NARRATIVE YEAR MONTH TOTAL_INJURIES TOTAL_DEATHS TOTAL_DAMAGE_ADJ
0 Wildfire SAN BERNARDINO 2000-01-06 04:26:00 2000-01-06 10:00:00 0 2000 1 1 0 54,654
1 Dense Fog W RIVERSIDE T X NW 2000-01-12 05:40:00 2000-01-12 08:30:00 0 2000 1 1 0 45,545
2 Heavy Rain SACRAMENTO 2000-01-22 18:00:00 2000-01-24 23:00:00 0 2000 1 1 0 27,327
3 High Wind X SW SAN BERNARDINO 2000-01-05 19:00:00 2000-01-06 13:30:00 0 2000 1 1 0 364,358
4 High Wind W RIVERSIDE T X NW 2000-01-05 19:00:00 2000-01-06 13:50:00 0 2000 1 1 0 364,358
... ... ... ... ... ... ... ... ... ... ...
6969 Wildfire SAN BERNARDINO AND RIVERSIDE COUNTY VALLEYS - ... 2024-07-21 00:00:00 2024-07-25 00:00:00 Hawarden wildfire started 130 pm on July 21, 2... 2024 7 0 0 10,000,000
6970 Wildfire SAN DIEGO COUNTY MOUNTAINS 2024-07-24 13:00:00 2024-07-26 00:00:00 Grove wildfire started around 1 pm PDT on July... 2024 7 0 0 30,000
6971 Thunderstorm Wind SAN DIEGO 2024-07-24 12:00:00 2024-07-24 15:00:00 East Warners SDGE site had a gust to 50 mph at... 2024 7 0 0 10,000
6972 Thunderstorm Wind SAN DIEGO 2024-07-24 15:00:00 2024-07-24 17:00:00 Pine Valley Horse mesonet reported gusts 53 mp... 2024 7 0 0 10,000
6973 Thunderstorm Wind SAN DIEGO 2024-07-24 14:00:00 2024-07-24 16:00:00 Shockey Truck Trail SGDE measured wind gust 51... 2024 7 0 0 10,000

6974 rows × 10 columns

Exploratory Data Analysis¶

We now create several plots and tables to help understand the data.

In [11]:
# Define impact type color map
impact_colors = {
    "damage": "firebrick",
    "injuries": "royalblue",
    "deaths": "seagreen"
}

Yearly totals over time¶

Let's plot the total damage, injuries, and fatalities by year over time.

In [12]:
# Aggregate yearly totals
yearly = df.groupby('YEAR').agg({
    'TOTAL_DAMAGE_ADJ': 'sum',
    'TOTAL_INJURIES': 'sum',
    'TOTAL_DEATHS': 'sum'
}).reset_index()

# Create subplots
fig = make_subplots(rows=1, cols=3, subplot_titles=["Total Damage", "Total Injuries", "Total Deaths"])

# Plot total damage
fig.add_trace(go.Scatter(x=yearly['YEAR'], y=yearly['TOTAL_DAMAGE_ADJ'],
                         mode='lines+markers', name='Damage ($)', line=dict(color=impact_colors["damage"])),
              row=1, col=1)

# Plot total injuries
fig.add_trace(go.Scatter(x=yearly['YEAR'], y=yearly['TOTAL_INJURIES'],
                         mode='lines+markers', name='Injuries', line=dict(color=impact_colors["injuries"])),
              row=1, col=2)

# Plot total deaths
fig.add_trace(go.Scatter(x=yearly['YEAR'], y=yearly['TOTAL_DEATHS'],
                         mode='lines+markers', name='Deaths', line=dict(color=impact_colors["deaths"])),
              row=1, col=3)

fig.update_layout(title_text="Yearly Totals: Damage, Injuries, and Deaths in California (2000-2024)",
                  height=400, width=1200, showlegend=False)

fig.show()

Top 10 event types by impact type (damage, injuries, and deaths)¶

In [13]:
# Aggregate by event type
event_summary = df.groupby('EVENT_TYPE').agg({
    'TOTAL_DAMAGE_ADJ': 'sum',
    'TOTAL_INJURIES': 'sum',
    'TOTAL_DEATHS': 'sum'
}).reset_index()

# Get top 10 for each category
top_damage = event_summary.nlargest(10, 'TOTAL_DAMAGE_ADJ')
top_injuries = event_summary.nlargest(10, 'TOTAL_INJURIES')
top_deaths = event_summary.nlargest(10, 'TOTAL_DEATHS')

# Create subplots
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=["By Damage (log scale)", "By Injuries (log scale)", "By Deaths"],
    horizontal_spacing=0.12)

# Bar for damage
fig.add_trace(go.Bar(
    x=top_damage['EVENT_TYPE'],
    y=top_damage['TOTAL_DAMAGE_ADJ'],
    name='Damage',
    marker_color=impact_colors['damage']
), row=1, col=1)

# Bar for injuries
fig.add_trace(go.Bar(
    x=top_injuries['EVENT_TYPE'],
    y=top_injuries['TOTAL_INJURIES'],
    name='Injuries',
    marker_color=impact_colors['injuries']
), row=1, col=2)

# Bar for deaths
fig.add_trace(go.Bar(
    x=top_deaths['EVENT_TYPE'],
    y=top_deaths['TOTAL_DEATHS'],
    name='Deaths',
    marker_color=impact_colors['deaths']
), row=1, col=3)

fig.update_layout(
    title_text="Top 10 Event Types by Damage, Injuries, and Deaths in California (2000-2024)",
    height=400,
    width=1200,
    showlegend=False)

# Set y to log scale for Damage and Injuries
fig.update_yaxes(type='log', row=1, col=1)
fig.update_yaxes(type='log', row=1, col=2)

fig.show()

Event type by frequency over time¶

Let's plot how often the top 5 most frequent event types occur each year.

In [14]:
# Count total number of events per EVENT_TYPE
most_common_types = df['EVENT_TYPE'].value_counts().nlargest(5).index.tolist()

# Filter for those event types
filtered_df = df[df['EVENT_TYPE'].isin(most_common_types)].copy()

# Group and count
freq_by_year = filtered_df.groupby(['YEAR', 'EVENT_TYPE']).size().reset_index(name='COUNT')

# Plot
fig = px.line(
    freq_by_year,
    x='YEAR',
    y='COUNT',
    color='EVENT_TYPE',
    title='Event Type Frequency Over Time (Top 5 Types), California (2000-2024)',
    markers=True,
    color_discrete_sequence=px.colors.qualitative.Dark2
)

fig.update_layout(height=400, width=1200, legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5))
fig.update_traces(opacity=0.9)
fig.show()

Bubble plot of damage vs fatalities by event type¶

In [15]:
# Group and summarize, including start date
bubble_data = df.groupby('EVENT_TYPE').agg({
    'TOTAL_DAMAGE_ADJ': 'sum',
    'TOTAL_DEATHS': 'sum',
    'TOTAL_INJURIES': 'sum',
    'BEGIN_DATE_TIME': 'min'  # Earliest event date
}).reset_index()

bubble_data['BEGIN_DATE_TIME'] = bubble_data['BEGIN_DATE_TIME'].dt.strftime('%Y-%m-%d')


# Plot with labels
fig = px.scatter(
    bubble_data,
    x='TOTAL_DAMAGE_ADJ',
    y='TOTAL_DEATHS',
    size='TOTAL_INJURIES',
    hover_name='EVENT_TYPE',
    hover_data={'BEGIN_DATE_TIME': True},
    text='EVENT_TYPE',  # Add this line to show event names as labels
    title='Damage vs. Fatalities by Event Type (Bubble Size = Injuries), California (2000–2024)',
    labels={
        'TOTAL_DAMAGE_ADJ': 'Total Damage (Adj $)',
        'TOTAL_DEATHS': 'Total Deaths',
        'TOTAL_INJURIES': 'Total Injuries',
        'BEGIN_DATE_TIME': 'Start Date'
    },
    height=500,
    width=1200,
    color_discrete_sequence=['#1f77b4']
)

fig.update_traces(
    marker=dict(opacity=0.6, line=dict(width=1, color='DarkSlateGrey')),
    textposition='top center'  # Position labels above bubbles
)
fig.update_layout(xaxis_type='log', yaxis_type='log')
fig.show()

High-impact events tables¶

Here we create three tables showing the top 10 most severe events by total damage, total deaths, and total injuries.

In [16]:
top_damage_events = (
    df[['BEGIN_DATE_TIME', 'EVENT_TYPE', 'TOTAL_DAMAGE_ADJ', 'EVENT_NARRATIVE']]
    .sort_values(by='TOTAL_DAMAGE_ADJ', ascending=False)
    .head(5)
    .copy()
)

top_death_events = (
    df[['BEGIN_DATE_TIME', 'EVENT_TYPE', 'TOTAL_DEATHS', 'EVENT_NARRATIVE']]
    .sort_values(by='TOTAL_DEATHS', ascending=False)
    .head(5)
    .copy()
)

top_injury_events = (
    df[['BEGIN_DATE_TIME', 'EVENT_TYPE', 'TOTAL_INJURIES', 'EVENT_NARRATIVE']]
    .sort_values(by='TOTAL_INJURIES', ascending=False)
    .head(5)
    .copy()
)

display(top_damage_events)
display(top_death_events)
display(top_injury_events)
BEGIN_DATE_TIME EVENT_TYPE TOTAL_DAMAGE_ADJ EVENT_NARRATIVE
4863 2018-11-08 06:33:00 Wildfire 21,238,016,726 The Camp Fire began on the morning of November...
3595 2014-12-01 00:00:00 Drought 1,987,820,797 Impacts to the region included: Reservoirs dro...
4920 2018-07-23 15:00:00 Wildfire 1,873,942,652 The Carr Fire began on the afternoon of July 2...
609 2003-10-25 17:37:00 Wildfire 1,784,157,375 0
608 2003-10-25 09:00:00 Wildfire 1,187,278,735 0
BEGIN_DATE_TIME EVENT_TYPE TOTAL_DEATHS EVENT_NARRATIVE
4863 2018-11-08 06:33:00 Wildfire 86 The Camp Fire began on the morning of November...
1622 2006-07-16 15:00:00 Heat 30 Fresno/Madera/Merced
4921 2018-01-09 04:00:00 Debris Flow 21 Intense rainfall generated a large and deadly ...
5430 2020-09-07 21:00:00 Wildfire 16 Immediate evacuation orders where given to Ber...
1624 2006-07-16 15:00:00 Heat 16 Tulare
BEGIN_DATE_TIME EVENT_TYPE TOTAL_INJURIES EVENT_NARRATIVE
4921 2018-01-09 04:00:00 Debris Flow 168 Intense rainfall generated a large and deadly ...
609 2003-10-25 17:37:00 Wildfire 90 0
5455 2020-09-05 08:00:00 Excessive Heat 77 Parts of the area rose up to 110 to 115 degree...
3791 2014-10-04 12:00:00 Heat 75 One of the local television stations and San D...
5393 2020-07-12 09:00:00 Wildfire 68 One of the Navy's vessels in the San Diego Bay...